A significant number of hotel bookings are called off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impacts a hotel on various fronts:
Objective The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
Booking_ID: the unique identifier of each booking
no_of_adults: Number of adults
no_of_children: Number of Children
no_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
no_of_week_nights: Number of weeknights (Monday to Friday) the guest stayed or booked to stay at the hotel
type_of_meal_plan: Type of meal plan booked by the customer: Not Selected – No meal plan selected Meal Plan 1 – Breakfast Meal Plan 2 – Half board (breakfast and one other meal) Meal Plan 3 – Full board (breakfast, lunch, and dinner)
required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)
room_type_reserved: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels Group
lead_time: Number of days between the date of booking and the arrival date
arrival_year: Year of arrival date
arrival_month: Month of arrival date
arrival_date: Date of the month
market_segment_type: Market segment designation.
repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)
no_of_previous_cancellations: Number of previous bookings that were canceled by the customer before the current booking
no_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the customer before the current booking
avg_price_per_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
booking_status: Flag indicating if the booking was canceled or not.
#importing libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # matplotlib.pyplot plots data
%matplotlib inline
import seaborn as sns
# Library to split data
from sklearn.model_selection import train_test_split
# To build model for prediction
from sklearn.svm import SVC
#Preprocessing
from sklearn.preprocessing import MinMaxScaler
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
)
# let colab access my google drive
from google.colab import drive
drive.mount('/content/drive')
#Reading the provided dataset
data=pd.read_csv('/content/INNHotelsGroup.csv')
data.head()
# copying data to another variable to avoid any changes to original data
df1 = data.copy()
df1.head()
df1.shape
There are 36275 rows and 10 columns in the given dataset.
df1.info()
Observations
5 columns are of the object type columns and 15 columns are of numerical type columns
# finding the number of missing values
df1.isnull().sum()
There are no missing values in the given dataset
df1.duplicated().sum()
df1.isna().sum()
df1.describe()
# Summarize categorical variables
categorical_columns = data.select_dtypes(include=['object']).columns
for col in categorical_columns:
print(f"Value counts for {col}:\n")
print(data[col].value_counts())
# Univariate Analysis for Numerical Columns
numerical_columns = data.select_dtypes(include=['int64', 'float64']).columns
for col in numerical_columns:
plt.figure(figsize=(8, 5))
sns.histplot(data[col], kde=True, bins=30, color='skyblue')
plt.title(f"Distribution of {col}", fontsize=14)
plt.xlabel(col, fontsize=12)
plt.ylabel("Frequency", fontsize=12)
plt.show()
categorical_columns = data.select_dtypes(include=['object']).columns
for col in categorical_columns:
# Limit the number of categories to avoid overloading
top_n_categories = data[col].value_counts().nlargest(10).index # Adjust 10 to the desired number of categories
plt.figure(figsize=(8, 5))
sns.countplot(data=df1, x=col, order=top_n_categories, hue=col, legend=False)
plt.title(f"Distribution of {col}", fontsize=14)
plt.xlabel(col, fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.xticks(rotation=45)
plt.show()
# Correlation heatmap for numerical features
numerical_features = df1.select_dtypes(include=['int64', 'float64']).columns
plt.figure(figsize=(10, 6))
correlation_matrix = df1[numerical_features].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', cbar=True)
plt.title('Correlation Heatmap', fontsize=14)
plt.show()
# Boxplot for numerical features vs target variable
target_column = 'booking_status'
for feature in numerical_features:
if feature != target_column:
plt.figure(figsize=(8, 4))
sns.boxplot(data=df1, x=target_column, y=feature, palette="viridis")
plt.title(f'{feature} vs {target_column}', fontsize=14)
plt.xlabel(target_column, fontsize=12)
plt.ylabel(feature, fontsize=12)
plt.show()
# Select categorical variables (excluding 'Booking_ID' as it's unique to each row)
categorical_vars = ['type_of_meal_plan', 'required_car_parking_space', 'room_type_reserved',
'market_segment_type', 'repeated_guest']
# Create a subplot for each categorical variable
fig, axes = plt.subplots(len(categorical_vars), 1, figsize=(10, 25))
# Bivariate analysis for each categorical variable
for i, var in enumerate(categorical_vars):
# Calculate percentage distribution of booking_status for each category
cat_analysis = data.groupby(var)['booking_status'].value_counts(normalize=True).unstack() * 100
# Plot the data as a stacked bar plot
cat_analysis.plot(kind='bar', stacked=True, ax=axes[i], colormap='viridis', alpha=0.8)
axes[i].set_title(f'Bivariate Analysis: {var} vs Booking Status', fontsize=14)
axes[i].set_ylabel('Percentage (%)')
axes[i].set_xlabel(var)
axes[i].legend(title='Booking Status')
axes[i].grid(axis='y', linestyle='--', alpha=0.7)
# Adjust layout and show the plots
plt.tight_layout()
plt.show()
# Calculate the total number of bookings per month
busiest_months = data.groupby('arrival_month').size().sort_values(ascending=False)
# Map month numbers to names for better readability
month_names = {
1: "January", 2: "February", 3: "March", 4: "April", 5: "May",
6: "June", 7: "July", 8: "August", 9: "September", 10: "October",
11: "November", 12: "December"
}
busiest_months.index = busiest_months.index.map(month_names)
busiest_months
The busiest months for the hotel, ranked by the total number of bookings, are:
# Calculate the distribution of bookings by market segment
market_segment_distribution = data['market_segment_type'].value_counts()
# Display the distribution
market_segment_distribution
# Calculate the distribution of bookings by market segment
market_segment_distribution = data['market_segment_type'].value_counts()
# Display the distribution
market_segment_distribution
# Calculate the average room price for each market segment
room_price_by_segment = data.groupby('market_segment_type')['avg_price_per_room'].mean().sort_values(ascending=False)
# Display the results
room_price_by_segment
# Calculate the percentage of canceled bookings
cancellation_rate = (data['booking_status'].value_counts(normalize=True)['Canceled'] * 100)
# Display the result
cancellation_rate
approximately 32.76% bookings are cancelled.
# Filter data for repeating guests
repeating_guests = data[data['repeated_guest'] == 1]
# Calculate the percentage of cancellations among repeating guests
repeating_guest_cancellation_rate = (repeating_guests['booking_status'].value_counts(normalize=True)['Canceled'] * 100)
# Display the result
repeating_guest_cancellation_rate
Approximately 1.72% of repeating guests cancel their bookings.
# Analyze the relationship between special requests and booking cancellations
special_requests_cancellation_rate = data.groupby('no_of_special_requests')['booking_status'].value_counts(normalize=True).unstack()['Canceled'] * 100
# Display the cancellation rates by the number of special requests
special_requests_cancellation_rate
Guests with no special requests are far more likely to cancel than those with one or more requests. so special requests may indicate a stronger commitment to the booking.
sns.pairplot(df1,diag_kind='kde')
We will use 70% of data for training and 30% for testing.
# Encode categorical variables using one-hot encoding
data_encoded = pd.get_dummies(data, columns=[
'type_of_meal_plan', 'room_type_reserved', 'market_segment_type'
], drop_first=True)
# Define features (X) and target (y)
X = data_encoded.drop(columns=['Booking_ID', 'booking_status'])
y = data_encoded['booking_status'].map({'Not_Canceled': 0, 'Canceled': 1}) # Encode target as 0/1
# Split the data into training and testing sets (70-30 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
# Output the shapes of the resulting datasets
print(f"Training Features Shape: {X_train.shape}")
print(f"Testing Features Shape: {X_test.shape}")
print(f"Training Labels Shape: {y_train.shape}")
print(f"Testing Labels Shape: {y_test.shape}")
print("{0:0.2f}% data is in training set".format((len(X_train)/len(df1.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(X_test)/len(df1.index)) * 100))
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Encode categorical variables using one-hot encoding
data_encoded = pd.get_dummies(df1, columns=[
'type_of_meal_plan', 'room_type_reserved', 'market_segment_type'
], drop_first=True)
# Define features (X) and target (y)
X = data_encoded.drop(columns=['Booking_ID', 'booking_status'])
y = data_encoded['booking_status'].map({'Not_Canceled': 0, 'Canceled': 1}) # Encode target as 0/1
# Split the data into training and testing sets (70-30 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
# Perform Logistic Regression
log_reg = LogisticRegression(solver="newton-cg", random_state=1) # Increased max_iter to ensure convergence
log_reg.fit(X_train, y_train)
# Make predictions
y_pred = log_reg.predict(X_test)
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nAccuracy Score:")
print(accuracy_score(y_test, y_pred))
model_score = log_reg.score(X_train, y_train)
print(model_score)
model_score = log_reg.score(X_test, y_test)
print(model_score)
print(classification_report(y_test, y_pred))
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
# instantiate learning model (k = 3)
knn_3 = KNeighborsClassifier(n_neighbors = 3)
# fitting the model
knn_3.fit(X_train, y_train)
#from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
#confusion_matrix_sklearn(knn_3, X_train, y_train)
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
confusion_matrix_sklearn(knn_3, X_train, y_train)
knn_perf_train_3 = model_performance_classification_sklearn(
knn_3, X_train, y_train
)
knn_perf_train_3
confusion_matrix_sklearn(knn_3, X_test, y_test)
knn_perf_test_3 = model_performance_classification_sklearn(
knn_3, X_test, y_test
)
knn_perf_test_3
# creating odd list of K for KNN
# myList = list(range(2,20))
# subsetting just the odd ones
# neighbors = list(filter(lambda x: x % 2 != 0, myList))
# creating a list of odd values of K for KNN
neighbors = [i for i in range(3,20) if i%2 != 0]
# empty list that will hold recall scores
recall_scores_train = []
recall_scores_test = []
# perform recall metrics
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
# predict on train and test
y_pred_train = knn.predict(X_train)
y_pred_test = knn.predict(X_test)
# evaluate recall on train and test
scores_train = recall_score(y_train, y_pred_train)
recall_scores_train.append(scores_train)
scores_test = recall_score(y_test, y_pred_test)
recall_scores_test.append(scores_test)
import matplotlib.pyplot as plt
# Plot recall scores for training and test sets
plt.figure(figsize=(8, 6))
plt.plot(neighbors, recall_scores_train, label='Training Recall', marker='o')
plt.plot(neighbors, recall_scores_test, label='Test Recall', marker='o')
plt.title('KNN Recall Scores for Different Values of K')
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Recall Score')
plt.xticks(neighbors)
plt.legend()
plt.grid(True)
plt.show()
The recall scores for both training and test sets are highest when k=3. This suggests that with k=3, the model is better at identifying positive instances in both the training and test data compared to other values of k.
As the value of k increases beyond 3, the recall scores tend to decrease for both training and test sets. This indicates a potential risk of the model not being able to identify the underlying patterns in the data.
Therefore, based on the provided recall scores, k=3 appears to be the most suitable choice for balancing model performance between capturing positive instances effectively and generalizing well to new data.
from sklearn.naive_bayes import GaussianNB # using Gaussian algorithm from Naive Bayes
# create the model
booking_model = GaussianNB()
booking_model.fit(X_train, y_train.ravel())
booking_train_predict = booking_model.predict(X_train)
from sklearn import metrics
print("Model Accuracy: {0:.4f}".format(metrics.accuracy_score(y_train, booking_train_predict)))
print()
booking_test_predict = booking_model.predict(X_test)
from sklearn import metrics
print("Model Accuracy: {0:.4f}".format(metrics.accuracy_score(y_test, booking_test_predict)))
print()
print("Confusion Matrix")
cm=metrics.confusion_matrix(y_test, booking_test_predict, labels=[1, 0])
df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True)
print("Classification Report")
print(metrics.classification_report(y_test, booking_test_predict, labels=[1, 0]))
noofk=list(range(1,20,2))
accuracy=[]
for k in noofk:
knn=KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train,y_train)
#y_pred=knn.predict(X_test)
acc=knn.score(X_test, y_test) # Calculate accuracy once
accuracy.append(acc) # Append the accuracy to the list
print(accuracy) # Print the updated accuracy list
sns.pointplot(x=noofk,y=accuracy,ci=None) # Plot the data with the same lengths for x and y
finalmodel=KNeighborsClassifier(n_neighbors = 1)
finalmodel.fit(X_train, y_train)
print(classification_report(y_test, finalmodel.predict(X_test)))
# defining a function to compute different metrics to check performance of a classification model
def model_performance_classification(
model, predictors, target
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting the class labels.
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(y_true=target,y_pred= y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# fitting SVM model
svm = SVC(kernel='linear')
svm.fit(X_train,y_train)
confusion_matrix_sklearn(model=svm,predictors= X_train, target=y_train)
confusion_matrix_sklearn(model=svm,predictors= X_test, target=y_test)
print("Training performance:")
model_lin_kern = model_performance_classification(model=svm, predictors=X_train, target=y_train)
model_lin_kern
print("Test performance:")
model_lin_kern_test = model_performance_classification(model=svm, predictors=X_test, target=y_test)
model_lin_kern_test
# fitting SVM model
svm = SVC(kernel='poly',degree=2)
svm.fit(X_train,y_train)
confusion_matrix_sklearn(model=svm,predictors= X_train, target=y_train)
confusion_matrix_sklearn(model=svm,predictors= X_test, target=y_test)
print("Training performance:")
model_poly_kern = model_performance_classification(model=svm, predictors=X_train, target=y_train)
model_poly_kern
print("Test performance:")
model_poly_kern_test = model_performance_classification(model=svm, predictors=X_test, target=y_test)
model_poly_kern_test
performance is decreasing.
# fitting SVM model
svm = SVC(kernel='poly',degree=3)
svm.fit(X_train,y_train)
confusion_matrix_sklearn(model=svm,predictors= X_train, target=y_train)
confusion_matrix_sklearn(model=svm,predictors= X_test, target=y_test)
print("Training performance:")
model_poly_kern_3 = model_performance_classification(model=svm, predictors=X_train, target=y_train)
model_poly_kern_3
print("Test performance:")
model_poly_kern_3_test = model_performance_classification(model=svm, predictors=X_test, target=y_test)
model_poly_kern_3_test
performance is not changing by increasing the degree of polynomial.
# fitting SVM model
svm = SVC(kernel='rbf')
svm.fit(X_train,y_train)
confusion_matrix_sklearn(model=svm,predictors= X_train, target=y_train)
confusion_matrix_sklearn(model=svm,predictors= X_test, target=y_test)
print("Training performance:")
model_rbf_kern = model_performance_classification(model=svm, predictors=X_train, target=y_train)
model_rbf_kern
print("Test performance:")
model_rbf_kern_test = model_performance_classification(model=svm, predictors=X_test, target=y_test)
model_rbf_kern_test
svm._gamma
#fitting SVM model
svm = SVC(kernel='poly',degree=2,gamma=0.3)
svm.fit(X_train,y_train)
confusion_matrix_sklearn(model=svm,predictors= X_train, target=y_train)
confusion_matrix_sklearn(model=svm,predictors= X_test, target=y_test)
print("Training performance:")
model_poly_gamma_1 = model_performance_classification(model=svm, predictors=X_train, target=y_train)
model_poly_gamma_1
print("Test performance:")
model_poly_gamma_1_test = model_performance_classification(model=svm, predictors=X_test, target=y_test)
model_poly_gamma_1_test
# fitting SVM model
svm = SVC(kernel='poly',degree=2,gamma=0.3,C=0.1)
svm.fit(X_train,y_train)
confusion_matrix_sklearn(model=svm,predictors= X_train, target=y_train)
confusion_matrix_sklearn(model=svm,predictors= X_test, target=y_test)
print("Training performance:")
model_poly_C_1 = model_performance_classification(model=svm, predictors=X_train, target=y_train)
model_poly_C_1
print("Test performance:")
model_poly_C_1_test = model_performance_classification(model=svm, predictors=X_test, target=y_test)
model_poly_C_1_test
# fitting SVM model
svm = SVC(kernel='poly',degree=2,gamma=0.3,C=0.05)
svm.fit(X_train,y_train)
confusion_matrix_sklearn(model=svm,predictors= X_train, target=y_train)
confusion_matrix_sklearn(model=svm,predictors= X_test, target=y_test)
print("Training performance:")
model_poly_C_2 = model_performance_classification(model=svm, predictors=X_train, target=y_train)
model_poly_C_2
print("Test performance:")
model_poly_C_2_test = model_performance_classification(model=svm, predictors=X_test, target=y_test)
model_poly_C_2_test
No improvement.
# training performance comparison
models_train_comp_df = pd.concat(
[
model_lin_kern.T,
model_poly_kern.T,
model_poly_kern_3.T,
model_rbf_kern.T,
model_poly_gamma_1.T,
model_poly_C_1.T,
model_poly_C_2.T
],
axis=1,
)
models_train_comp_df.columns = [
"SVM-Linear Kernel (default)",
"SVM-Polynomial Kernel , degree = 2",
"SVM-Polynomial Kernel , degree = 3",
"SVM-Rbf Kernel",
"SVM-Polynomial Kernel , degree = 2 , gamma = 0.016 ",
"SVM-Polynomial Kernel , degree = 2 , gamma = 0.3, C = 0.1 ",
"SVM-Polynomial Kernel , degree = 2 , gamma = 0.3 , C = 0.05 "
]
print("Training performance comparison:")
models_train_comp_df
The linear kernel and the polynomial kernel (degree = 2, gamma = 0.3, C = 0.05) have the best balance across all metrics, with similar F1 scores (~0.677).
The linear kernel is simpler and has fewer hyperparameters, making it a good default choice.
# testing performance comparison
models_test_comp_df = pd.concat(
[
model_lin_kern_test.T,
model_poly_kern_test.T,
model_poly_kern_3_test.T,
model_rbf_kern_test.T,
model_poly_gamma_1_test.T,
model_poly_C_1_test.T,
model_poly_C_2_test.T
],
axis=1,
)
models_test_comp_df.columns = [
"SVM-Linear Kernel (default)",
"SVM-Polynomial Kernel , degree = 2",
"SVM-Polynomial Kernel , degree = 3",
"SVM-Rbf Kernel",
"SVM-Polynomial Kernel , degree = 2 , gamma = 0.016 ",
"SVM-Polynomial Kernel , degree = 2 , gamma = 0.3, C = 0.1 ",
"SVM-Polynomial Kernel , degree = 2, gamma = 0.3 , C = 0.05 "
]
print("Test set performance comparison:")
models_test_comp_df
Use Polynomial Kernel with Degree = 2, Gamma = 0.3, and C = 0.05:
Best performance on the test set. Balances recall, precision, and F1 score effectively.
Conclusion: Logistic regression is a good baseline model but recall is less.
Conclusion: KNN is good interms of recall and accuracy. But precision is less
Conclusion: SVM offers the highest accuracy and precision. Best Algorithm Selection
If Overall Balance (F1-Score) is Most Important: KNN is a strong contender due to its balanced F1-Score of 0.71.
If Accuracy is Most Important: SVM narrowly edges out KNN with an accuracy of 81.07% compared to KNN’s 81%.
Choose KNN if recall is your priority.
Choose SVM if you need the highest accuracy and precision or a balanced performance.